Gradient boosted machines (GBMs) are an extremely popular machine learning algorithm that have proven successful across many domains and is one of the leading methods for winning Kaggle competitions. Whereas random forests (Chapter @ref(random_forest)) build an ensemble of deep independent trees, GBMs build an ensemble of shallow and weak successive trees with each tree learning and improving on the previous. When combined, these many weak successive trees produce a powerful “committee” that are often hard to beat with other algorithms. This chapter will cover the fundamentals to understanding and implementing GBMs.
This chapter leverages the following packages. Some of these packages paly a supporting role; however, our focus is on demonstrating how to implement GBMs with the gbm [@R-gbm], xgboost [@R-xgboost] and h2o pacakges and discuss the pros and cons to each.
library(rsample) # data splitting
library(gbm) # original implementation of gbm
library(xgboost) # a faster implementation of gbm
library(h2o) # a java-based platform
library(vip) # visualize feature importance
library(pdp) # visualize feature effects
library(ggplot2) # model visualization
Advantages: * Often provides predictive accracy that cannot be beat * Lots of flexibility - can optimize on different loss functions and provides several hyperparameter tuning options that make the function fit very flexible. * No data pre-processing required - often works great with categorical and numerical values as is. * Handles missing data - imputation not required.
Disadvantages: * GBMs will continue improving to minimize all errors. This can overemphasize outliers and cause overfitting. Must use cross-validation to neutralize. * Computationally expensive - GBMs often require many trees (>1000) which can be time and memory exhaustive. * The high flexibility results in many parameters that interact and influence heavily the behavior of the approach (number of iterations, tree depth, regularization parameters, etc.). This requires a large grid search during tuning. * Less interpretable although this is easily addressed with various tools (variable importance, partial dependence plots, local variable importance, etc.).
Sevaral supervised machine learning models are founded on a single predictive model such as linear regression, penalized models, naive Bayes, support vector machines. Alternatively, other approaches such as bagging and random forests are built on the idea of building an ensenmble of models where each individual model predicts the outcome and then the ensemble simply averages the predicted values. The family of boosting methods is based on a different, constructive strategy of ensenmle formation.
The main idea of boosting is to add new models to the ensemble sequentially. At each particular iteration, a new weak, base-learner model is trained with respect to the error of the whole ensemble learnt so far.
Sequential ensemble approach.
Let’s discuss each component of the previous sequence in closer detail because they are important.
Base-leaning models: Boosting is a framework that iteratively improves any weak learning model. Many gradient boosting applications allow you to “plug in” various classes of weak learners at your disposal. In practice however, boosted algorithms almost always use decision trees as the base-learner. Consequently, this chapter will discuss boosting in the context of decision trees.
Training weak models: A weak model is one whose error rate is only slightly better than random guessing. The idea behind boosting is that each sequential builds a simple weak model to slightly improve the remining errors. With regards to decision trees, shallow trees represent a weak learner. Commonly, trees with only 1-6 splits are used. Combining many weak models (versus strong ones) has a few benefits:
Sequential training with respect to errors: Boosted trees are grown sequentially; each tree is grown using information from previously grown trees. The basic algorithm for boosted regression trees can be generalized to the following where x represents our features and y represents our response:
The basic algorithm for boosted decision trees can be generalized to the following where the final model is simply a stagewise additive model of b individual trees:
\[ f(x) = \sum^B_{b=1}f^b(x) \tag{1} \]
To illustrate the behavior, assume the following x and y observations. The blue sine wave represents the true underlying function and the points observations that include some irriducible error (noise). The boosted prediction illustrates the adjusted predictions after each additional sequential tree is added to the algorithm. Initially, there are large errors wich the boosted algorithm improves upon immediately but as the predictions get closer to the true underlying function you see each additional tree make small improvements in different areas across the feature space where errors remain. Towards the end of the gif, the predicted values nearly converge to the true underlying function.
Boosted regression tree predictions (courtesy of Brandon Greenwell)
Many algorithms, including decision trees, focus on minimizing the residuals and therefore, emphasize the MSE loss function. The algorithm discussed in the previous section outlines the approach of sequentially fitting regression trees to minimize the errors. This specific approach is how gradient boosting minimizes the mean squared error (MSE) or to be able to apply the method to a classification problem with a loss function such such as deviance. The name gradient boosting machine comes from the fact that this procedure can be generalized to loss functions other than MSE.
Gradient boosting is considered a gradient descent algorithm, gradient descent is a very generic optimization algorithm capable of finding optimal solutions to a wide range of problems. The general idea of gradient descent is to tweak parameters iteratively in order to minimize a cost function. Suppose you are a downhill skier racing your fiend. A good strategy to beat your friend to the bottom is to take the path with the steepest slope. This is exactly what gradient descent does - it measures the local gradient of the descending gradient. Once the gradient is zero, we have reached the minimum.
Gradient descent is the process of gradually decreasing the cost function (i.e. MSE) by tweaking parameters iteratively until you have reached a minimum. Image courtesy of @geron2017hands.
Gradient descent can be performed on any loss function that is differentiable. Consequently, this allows GBMs to optimize different loss functions as desired (see @esl, p.360 for common loss functions). An important parameter in gradient descent is
the size of the steps which is determined by the learning rate. If the learning rate is too small, then the algorithm will take many iterations to find the minimum. On the other hand, if the learning rate is too high, you might jump across the minimum and end up further away than when you started.
A learning rate that is too small will require many iterations to find the minimum. A learning rate too big may jump over the minimum. Image courtesy of @geron2017hands.
Moreover, not all cost functions are convex (bowl shaped). There may be local minimas, plateaus, and other irregular terrain of the loss function that makes finding the global minimum difficult. Stochastic gradient descent can help us address this problem by sampling a fraction of the training observations (typically without replacement) and growing the next tree using that subsample. This makes the algorithm faster but the stochastic nature of random sampling also adds some random nature in descending the loss function gradient. Although this randomness does not allow the algorithm to find the absolute global minimum, it can actually help the algorithm jump out of local minima and off plateaus and get near the global minimum.
Stochastic gradient descent will often find a near-optimal solution by jumping out of local minimas and off plateaus. Image courtesy of @geron2017hands.
As we’ll see in the next section, there are several hyperparameter tuning options that allow us to address how we approach the gradient descent of our loss function.
Part of the beauty and challenges of GBMs is that they offer several tuning parameters. The beauty in this is GBMs are highly flexible. The challenge is that they can be time consuming to tune and find the optimal combination of hyperparamters. The most common hyperparameters that you will find in most GBM implementations include:
Throughout this chapter you’ll be exposed to additional hyperparameters that are specific to certain packages and can improve performance and/or the efficiency of training and tuning models.
There are many packages that implement GBMs and GBM variants. You can find a fairly comprehensive list here and at the CRAN Machine Learning Task View. However, the most popular implementations which we will cover in this post include:
To illustrate various GBM concepts for a regression problem, we will continue with the Ames, IA housing data, where our intention is to predict Sale_Price.
# create training (70%) and test (30%) sets for the AmesHousing::make_ames() data.
# Use set.seed for reproducibility
library(AmesHousing)
set.seed(123)
ames_split <- initial_split(AmesHousing::make_ames(), prop = .7, strata = "Sale_Price")
ames_train <- training(ames_split)
ames_test <- testing(ames_split)
Notes Tree-based methods tend to perform well on unprocessed data (i.e., without normalizing, centering, scaling features). In this chapter, I focus on how to implement GBMs with various packages. Although I do not pre-process the data, realize that you can improve model perforlance by spending time processing variable attributes.
gbm has two primary training functions - gbm::gbm and gbm::gbm.fit. The primary difference is that gbm::gbm uses the formula interface to specify your model whereas gbm::gbm.fit requires separated x and y matrices. When working with many variables it is more efficient to use the matrix rather than formula interface.
The default settings in gbm includes a learning rate (shrinkage) of 0.001. This is very small learning rate and typically requires a large number of trees to find the minimum MSE. However, gbm uses a default number of trees of 100, which is rarely sufficient. Consequently, I crank it up to 5,000 trees. The default depth of each tree interaction.depth is 1, which means we are ensembling a bunch of stumps. Lastly, I also include cv.filds to perform a 5 fold cross validation.
Notes The model took 48 seconds to run and the results show that our MSE loss function is minimized with 5,000 trees.
# for reproducibility
set.seed(123)
# train GBM model
gbm.fit <- gbm(
formula = Sale_Price~.,
distribution = "gaussian",
data = ames_train,
n.trees = 5000,
interaction.depth = 1,
shrinkage = 0.001,
cv.folds = 5,
n.cores = NULL, #will use all cores by default
verbose = FALSE
)
# print results
print(gbm.fit)
## gbm(formula = Sale_Price ~ ., distribution = "gaussian", data = ames_train,
## n.trees = 5000, interaction.depth = 1, shrinkage = 0.001,
## cv.folds = 5, verbose = FALSE, n.cores = NULL)
## A gradient boosted model with gaussian loss function.
## 5000 iterations were performed.
## The best cross-validation iteration was 5000.
## There were 80 predictors of which 27 had non-zero influence.
The output object is a list containing several modeling and results information. We can acess this information with regular indexing; I recommend you take some time to dig around in the object to get comfortable with its components. Here, we see that the minimum CV RMSE is $33,079.61, but the plot also illustrates that CV error (MSE) is still decreasing at 5,000 trees.
# get MSE and compute RMSE
gbm.fit$cv.error %>% min() %>% sqrt()
## [1] 33079.61
# plot loss function as a result of n trees added to the ensemble
gbm.perf(gbm.fit, method = "cv")
Training and cross-validated MSE as n trees are added to the GBM algorithm.
In this case, the small learning rate is resulting in very small incremental improvements which means many trees are required. In fact, with the default learning rate and tree depth settings, the CV error is still reducing after 10,000 trees!
However, rarely do the default settings suffice. We could tune parameters one at a time to see how the results change. For example, here, I increase the learning rate to take larger steps down the gradient descent, reduce the number of trees (since we are reducing the learning rate), and increase the depth of each tree from using a single split to 3 splits. Our RMSE (/$23,713.34) is lower than our initial model and the optimal number of trees required was 964.
# for reproducibility
set.seed(123)
# train GBM model
gbm.fit2 <- gbm(
formula = Sale_Price ~ .,
distribution = "gaussian",
data = ames_train,
n.trees = 5000,
interaction.depth = 3,
shrinkage = 0.1,
cv.folds = 5,
n.cores = NULL, #will use all cores by default
verbose = FALSE
)
# find index for n trees with minimum CV error
min_MSE <- which.min(gbm.fit2$cv.error)
# get MSE and compute RMSE
gbm.fit2$cv.error[min_MSE] %>% sqrt()
# plot loss function as a result of n trees added to the ensemble
gbm.perf(gbm.fit2, method = "cv")
Training and cross-validated MSE as n trees are added to the GBM algorithm. We can see that are new hyperparameter settings result in a much quicker progression down the gradient descent than our initial model.
However, a better option than manually tweaking hyperparameters one at a time is to perform a grid search which iterates over every combination of hyperparameter values and allows us to assess which combination tends to perform well. To perform a manual grid search, first we want to construct our grid of hyperparameter combinations. We’re going to search across 81 models with varying learning rates and tree depth.
I also vary the minimum number of observations allowed in the trees terminal nodes (n.minosinnode) and introduce stochastic gradient descenet by allowing bag.fraction <1.
# create hyperparameter grid
hyper_grid <- expand.grid(
shrinkage = c(.01, .1, .3),
interaction.depth = c(1,3,5),
n.minobsinnode = c(5,10,15),
bag.fraction = c(.65, .8, 1),
optimal_trees = 0, # a place to dump results
min_RMSE = 0 # a place to dump results
)
# total number of combinations
nrow(hyper_grid)
## [1] 81
We loop through each hyperparameter combination and apply 5,000 trees. However, to speed up the tuning process, instead of performing 5-fold CV I train on 75% of the training observations, and evaluate performance on the remaining 25%.
Notes When using train.fraction it will take the first XX% of the data so its important to randomize your rows in case there is any logic behind the ordering of the data (i.e., ordered by neighborhood).
Our grid search revealed a few important attributes. First, our top model has better performance than our previously fitted model above and any of the other models covered in Chapters @ref(regularized-regression) and @ref(random-forest), with an RMSE of $20,390.55. Second, looking at the top 10 models we see that:
interaction.depth = 5); there are likely stome important interactions that the deeper trees are able to capture,Notes Searching this entire grid took 36 minutes.
# randomize data
random_index <- sample(1:nrow(ames_train), nrow(ames_train))
random_train <- ames_train[random_index,]
# grid search
for (i in 1:nrow(hyper_grid)){
# reprocudibility
set.seed(123)
# train model
gbm.tune <- gbm(
formula = Sale_Price ~ .,
distribution = "gaussian",
data = random_train,
n.trees = 5000,
interaction.depth = hyper_grid$interaction.depth[i],
shrinkage = hyper_grid$shrinkage[i],
n.minobsinnode = hyper_grid$n.minobsinnode[i],
bag.fraction = hyper_grid$bag.fraction[i],
train.fraction = .75,
n.cores = NULL, # will use all cores by default
verbose = FALSE
)
# add min training error and trees to grid
hyper_grid$optimal_trees[i] <- which.min(gbm.tune$valid.error)
hyper_grid$min_RMSE[i] <- sqrt(min(gbm.tune$valid.error))
}
hyper_grid %>%
dplyr::arrange(min_RMSE) %>%
head(10)
## shrinkage interaction.depth n.minobsinnode bag.fraction optimal_trees min_RMSE
## 1 0.01 5 5 0.80 4911 20390.55
## 2 0.01 5 5 0.65 4726 20588.02
## 3 0.10 5 10 0.80 500 20758.72
## 4 0.01 5 10 0.80 4897 20761.53
## 5 0.01 5 5 1.00 5000 20997.68
## 6 0.10 5 5 1.00 665 21277.84
## 7 0.10 5 5 0.80 514 21277.90
## 8 0.01 5 10 0.65 4987 21310.94
## 9 0.01 5 15 0.80 4990 21456.17
## 10 0.01 5 10 1.00 4960 21481.65
These results help us to zoom into areas where we can refine our search. In practice, tuning is an iterative process, so you would likely refine this search grid to analyze a search space around the top models. For example I would likely search the following values in my next grid search:
along with the previously assessed values for n.minobsinnode and bag.fraction. Also, since we used nearly 5000 trees when the learning rate was 0.01. I would increase this to ensure there are enough trees for learning rate \(=0.005\).
Once we have found our top model we train a model with those specific parameters. I’ll use the top model in our grid search and since the model converted at 4911 trees I train a model with that many trees.
# for reproducibility
set.seed(123)
# train GBM model
gbm.fit.final <- gbm(
formula = Sale_Price~.,
distribution = "gaussian",
data = ames_train,
n.trees = 4342,
interaction.depth = 5,
shrinkage = 0.01,
n.minobsinnode = 5,
bag.fraction = .80,
train.fraction = 1,
n.cores = NULL, # will use all cores by default
verbose = FALSE
)
Similar to random forests, GBMs make no assumption regarding the linearity and monotocity of the predictor-response relationship. So as we did in the random forest chapter (Chapter @ref(random-forest)). We can understand the relationship between the features and the response using variable importance plots and partial dependence plots.
Notes Additional model interpretability approach will be discussed in the model interpretability chapter.
After re-running our final model, we likely want to understand the variables that have the largest influence on our response variable. The summary method for gbm will output a data frame and a plot that shows the most influential variables. cBars allows you to adjust the number of variables to show (in order of influence). The default method for computing variable importance is with relative influence, but your options include:
method = relative.influence: At each split in each tree, gbm computes the improvement in the split-criterion (MSE for regression). gbm then averages the improvement made by each variable across all the trees that the variable is used. The variables with the largest average decrease in MSE are considered most important.method = permutation.test.gbm: For each tree, the OOB sample is passed down the tree and the prediction accuracy is recorded. Then the values for each variable (one at a time) are randomly permuted and the accuracy is again computed. The decrease in accuracy as a result of this randomly shaking up of variable values is averaged over all the trees for each variable. The variables with the largest average decrease in accuracy are considered most important.par(mfrow = c(1, 2), mar = c(5, 10, 1, 1))
# relative influence approach
summary(gbm.fit.final, cBars = 10, method = relative.influence, las = 2)
# permutation approach
summary(gbm.fit.final, cBars = 10, method = permutation.test.gbm, las = 2)
Top 10 influential variables using the relative influence (left) and permutation (right) approach. We can see common themes among the top variables although in differing order.
After the most relevant variables have been identified, we can use partial dependence plots (PDPs) and individual conditional correlation (ICE) curves to better understand the relationship between the predictors and response. Here, we plot two of the most influential variables (Gr_Liv_Area and Overall_Qual). We see that both predictor non-linear relationships with the sale price.
Notes As in Chapter @ref(random-forest), you can produce ICE curves by incorporating ice = TRUE and center = TRUE (for centered ICE curves).
p1 <- gbm.fit.final %>%
partial(
pred.var = "Gr_Liv_Area",
n.trees = gbm.fit.final$n.trees,
grid.resolution = 50
) %>%
autoplot(rug = TRUE, train = ames_train)
p2 <- gbm.fit.final %>%
partial(
pred.var = "Overall_Qual",
n.trees = gbm.fit.final$n.trees,
train = data.frame(ames_train)
) %>%
autoplot()
gridExtra::grid.arrange(p1, p2, nrow = 1)
The mean predicted sale price as Gr_Liv_Area and Overall_Qual change in value.
Once you’ve found your optimal model, predicting new observations with the gbm model follows the same procedure as most R models. We simply use the predict function; however, we also need to supply the number of trees to use (see ?predict.gbm for details). We see that our RMSE for our test set is right in line with the optimal RMSE obtained during our grid search and far better than any model to-date.
# predict values for test data
pred <- predict(gbm.fit.final, n.trees = gbm.fit.final$n.trees, ames_test)
# results
caret::RMSE(pred, ames_test$Sale_Price)
## [1] 20859.01
The xgboost package only works with matrices that contain all numeric variables; consequentyl, we need to one hot encode or data. Throughout this book, we’ve illustrated different ways to do this in R (i.e., Matrix::sparse.model.matrix, caret::dummyVars), but here we will use the vtreat package.
vtreat is a robust package for data prep and helps to eliminate problems caused by missing values, novel categorical levels that appear in future data sets that were not in the training data, etc. However, vtreat is not very intuitive. I will not explain the functionalities but you can find more information here, here, and here.
The following applies vtreat to one-hot encode the traing and testing data sets.
# variable names
features <- setdiff(names(ames_train), "Sale_Price")
# create the treatment plan from the training data
treatplan <- vtreat::designTreatmentsZ(ames_train,features, verbose=FALSE)
# get the "clean" variable names from the scoreFrame
new_vars <- treatplan %>%
magrittr::use_series(scoreFrame) %>%
dplyr::filter(code %in% c("clean", "lev")) %>%
magrittr::use_series(varName)
# prepare the training data
features_train <- vtreat::prepare(treatplan, ames_train, varRestriction = new_vars) %>%
as.matrix()
response_train <- ames_train$Sale_Price
# prepare the test data
features_test <- vtreat::prepare(treatplan, ames_test, varRestriction = new_vars) %>% as.matrix()
response_test <- ames_test$Sale_Price
# dimensions of one-hot encoded data
dim(features_train)
## [1] 2054 345
dim(features_test)
## [1] 876 345
features_train %>%
dplyr::as_tibble() %>%
head()
xgboost provides different training functions (i.e., xgb.train which is just a wrapper for xgboost). However, to train an xgboost model we typically want to use xgb.cv, which incorporates cross-validation. The following trains a basic 5-fold cross validated XGBoost model with 1,000 trees. There are many parameters available in xgb.cv but the ones you have become more familiar with in this chapter include the following default values:
eta): 0.3max_depth): 6min_child_weight): 1subsample –> equivalent to gbm’s bag.fraction): 100%Notes
This model took nearly 2 minutes to run. The reason xgboost seem slower than the gbm is since we one-hot encode our data, xgboost is searching across 211 features where gbm uses non-one-hot encoded which means it was only searching across 80 features.
# reproducibility
set.seed(123)
xgb.fit1 <- xgb.cv(
data = features_train,
label = response_train,
nrounds = 1000,
nfold = 5,
objective = "reg:linear", # for regression models
verbose = 0 # silent
)
The xgb.fit1 object contains lots of good information. In particular, we can assess the xgb.fit1$evaluation_log to identify the minimum RMSE and optimal number of trees for both training data and the cross-validated error. We can see thhat the training error continues to decreasing through 90 trees where the RMSE nearly reaches 0; however the cross validated error reaches a minimum RMSE of $26,758.30 with only 98 trees.
# get number of trees that minimize error
xgb.fit1$evaluation_log %>%
dplyr::summarise(
ntrees.train = which.min(train_rmse_mean),
rmse.train = min(train_rmse_mean),
ntrees.test = which.min(test_rmse_mean),
rmse.test = min(test_rmse_mean)
)
## ntrees.train rmse.train ntrees.test rmse.test
## 1 980 0.05009 98 26758.3
# plot error vs number trees
xgb.fit1$evaluation_log %>%
tidyr::gather(error, RMSE, train_rmse_mean, test_rmse_mean) %>%
ggplot(aes(iter,RMSE, color=error))+
geom_line()
Training (blue) and cross-validation (red) error for each additional tree added to the GBM algorithm. The CV error is quickly minimized with 98 trees while the training error reduces to near zero over 980 trees.
A nice feature provided by xgb.cv is early stopping. This allows us to tell the function to stop running if the cross validated error does not improve for n continuous trees. For example, the above model could be re-run with the following where we tell it stop if we see no improvement for 10 consecutive trees. This feature will help us speed up the tuning process in the next section.
Notes This reduced our training time from 2 minutes to 8 seconds!
# reprocudibility
set.seed(123)
xgb.fit2 <- xgb.cv(
data = features_train,
label = response_train,
nrounds = 1000,
nfold = 5,
objective = "reg:linear", #for regression models
verbose = 0, # silent,
early_stopping_rounds = 10 # stop if no improvement for 10 consecutive trees
)
# plot error vs number trees
xgb.fit2$evaluation_log %>%
tidyr::gather(error, RMSE, train_rmse_mean, test_rmse_mean) %>%
ggplot(aes(iter,RMSE, color=error))+
geom_line()
Early stopping allows us to stop training once we experience no additional improvement on our cross-validated error.
To tune the XGBoost model, we pass parameters as a list object to the params argument. The most common parameters include:
eta:controls the learning ratemax_depth: tree depthmin_child_weight: minimum number of observations required in each terminal nodesubsample: percent of training data to sample for each treecolsample_bytrees: percent of columns to sample from for each treeFor example, if we wanted to specify values for these parameters, we would extend the above model with the following parameters.
# create parameter list
params <- list(
eta = .1,
max_depth = 5,
min_child_weight =2,
subsample = .8,
colsample_bytree =.9
)
# reproducibility
set.seed(123)
# train model
xgb.fit3 <- xgb.cv(
params = params,
data = features_train,
label = response_train,
nrounds = 1000,
nfold = 5,
objective = "reg:linear", # for regression models
verbose = 0, # silent
early_stopping_rounds = 10 # stop if no improvement for 10 consecutive trees
)
# assess results
xgb.fit3$evaluation_log %>%
dplyr::summarise(
ntrees.train = which.min(train_rmse_mean),
rmse.train = min(train_rmse_mean),
ntrees.test = which.min(test_rmse_mean),
rmse.test = min(test_rmse_mean),
)
## ntrees.train rmse.train ntrees.test rmse.test
## 1 122 7954.668 112 24547.28
To perform a large search grid, we can follow the same procedure we did with gbm. We create our hyperparameter search grid along with columns to dump our results in. Here, I created a pretty large search grid consisting of 108 different hyperparameter combinations to model.
# create hyperparameter grid
hyper_grid <- expand.grid(
eta = c(.05, .1, .15),
max_depth = c(3, 5, 7),
min_child_weight = c(5, 10, 15),
subsample = c(.65, .8),
colsample_bytree = c(.9, 1),
optimal_trees = 0, # a place to dump results
min_RMSE = 0 # a place to dump results
)
nrow(hyper_grid)
## [1] 108
Now I apply the same for loop procedure to loop through and apply an xgboost model for each hyperparameter combination and dump the results in the hyper_grid data frame. Our minimum RMSE ($23,316.40) is a little higher than the gbm model, likely a result of one-hot encoding our data and how the models treat these dummy coded variables differently.
Notes This full search grid took 34 minutes to run!
# grid search
for (i in 1:nrow(hyper_grid)){
# create parameter list
params <- list(
eta = hyper_grid$eta[i],
max_depth = hyper_grid$max_depth[i],
min_child_weight = hyper_grid$min_child_weight[i],
subsample = hyper_grid$subsample[i],
colsample_bytree = hyper_grid$colsample_bytree[i]
)
# reproducibility
set.seed(123)
# train model
xgb.tune <- xgb.cv(
params = params,
data = features_train,
label = response_train,
nrounds = 5000,
nfold = 5,
objective = "reg:linear", # for regression models
verbose = 0, # silent,
early_stopping_rounds = 10 # stop if no improvement for 10 consecutive trees
)
# add min training error and trees to grid
hyper_grid$optimal_trees[i] <- which.min(xgb.tune$evaluation_log$test_rmse_mean)
hyper_grid$min_RMSE[i] <- min(xgb.tune$evaluation_log$test_rmse_mean)
}
hyper_grid %>%
dplyr::arrange(min_RMSE) %>%
head(10)
## eta max_depth min_child_weight subsample colsample_bytree optimal_trees min_RMSE
## 1 0.05 7 5 0.65 1.0 481 23316.40
## 2 0.05 5 5 0.65 1.0 355 23515.42
## 3 0.05 5 5 0.80 1.0 469 23856.31
## 4 0.05 5 5 0.65 0.9 312 23888.34
## 5 0.05 7 5 0.80 1.0 660 23904.30
## 6 0.15 3 5 0.65 1.0 227 23909.78
## 7 0.05 3 5 0.80 1.0 567 23967.56
## 8 0.05 3 10 0.80 1.0 665 23998.05
## 9 0.15 3 5 0.65 0.9 230 24001.86
## 10 0.05 5 10 0.80 1.0 624 24033.78
After assessing the results, you would likely perform a few more grid searches to hone in on the parameters that appear to influence the model the most.In fact, here is a link to a great blog post that discusses a strategic approach to tuning with xgboost. However, for brevity, we’ll just assume the top model in the above search is the globally optimal model. Once you’ve found the optimal model, we can fit our final model with xgb.train or xgboost.
# parameter list
params <- list(
eta = 0.05,
max_depth = 7,
min_child_weight = 5,
subsample = 0.65,
colsample_bytree = 1
)
# train final model
xgb.fit.final <- xgboost(
params = params,
data = features_train,
label = response_train,
nrounds = 481,
objective = "reg:linear",
verbose = 0
)
xgboost provides built-in variable importance plotting. First, you need to create the importance matrix with xgb.importance and then feed this matrix into xgb.plot.importance (or xgb.ggplot.importance for ggplot output.) xgboost provides 3 variable importance measures:
relative.influence.Notes The xgb.ggplot.importance plotting mechanism will also perform a cluster analysis on the features based on their improtance scores. This becomes more useful when visualizing many features (i.e., 50) and you want to categorize them based on their improtance.
# create importance matrix
importance_matrix <- xgb.importance(model=xgb.fit.final)
# variable importance plot
p1 <- xgb.ggplot.importance(importance_matrix, top_n = 10, measure = "Gain") + ggtitle("Gain") + theme(legend.position="bottom")
p2 <- xgb.ggplot.importance(importance_matrix, top_n = 10, measure = "Cover") + ggtitle("Cover") + theme(legend.position="bottom")
p3 <- xgb.ggplot.importance(importance_matrix, top_n = 10, measure = "Frequency") + ggtitle("Frequency") + theme(legend.position="bottom")
gridExtra::grid.arrange(p1, p2, p3, ncol = 1)
Top 25 influential variables for our final xgboost model based on the three different variable importance metrics.
PDP and ICe plots work similarly to how we implemented them with gbm. The only difference is you need to incorporate the training data within the partial function since these data cannot be extracted directly from the model object. We see a similar non-linear relationships between Gr_Liv_Area and predicted sale price as we did with gbm and in the random forest models; however, note the unique dip right after Gr_Liv_Area reaches 3,000 square feet. We saw this dip in the gbm model; however, it is a pattern that was not picked up on by the random forests models.
Notes You do not need to supply the number of trees with n.trees = xgb.fit.final$niter; however, when supplying a cross-validated model where the optimal number of trees may be less than the total number of trees ran, then you will want to supply the optimal number of trees to the n.trees parameter.
pdp <- xgb.fit.final %>%
partial(
pred.var = "Gr_Liv_Area_clean",
n.trees = xgb.fit.final$niter,
grid.resolution = 50,
train = features_train
) %>%
autoplot(rug = TRUE, train = features_train) +
ggtitle("PDP")
ice <- xgb.fit.final %>%
partial(
pred.var = "Gr_Liv_Area_clean",
n.trees = xgb.fit.final$niter,
grid.resolution = 100,
train = features_train,
ice = TRUE
) %>%
autoplot(rug = TRUE, train = features_train, alpha = .05, center = TRUE) +
ggtitle("ICE")
gridExtra::grid.arrange(pdp, ice, nrow = 1)
Lastly, we use predict to predict on new observations; however, unlike gbm we do not need to provide the number of trees.
# predict values for test data
pred <- predict(xgb.fit.final, features_test)
# test set results
caret::RMSE(pred, response_test)
## [1] 23277.69
Lets go ahead and start up h2o:
h2o.no_progress()
h2o.init(max_mem_size = "5g")
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 21 minutes 47 seconds
## H2O cluster timezone: Asia/Tokyo
## H2O data parsing timezone: UTC
## H2O cluster version: 3.20.0.8
## H2O cluster version age: 3 months and 24 days !!!
## H2O cluster name: H2O_started_from_R_kojikm.mizumura_vuw540
## H2O cluster total nodes: 1
## H2O cluster total memory: 5.00 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Algos, AutoML, Core V3, Core V4
## R Version: R version 3.5.0 (2018-04-23)
h2o.gbm allows us to perform a GBM with H2o. However, prior to running our initial model we need to convert our training data to an h2o object. By default, h2o.gbm applies a GBM model with the following parameters:
ntrees): 50learn_rate): 0.1max_depth): 5min_rows): 10Since we are performing a 5-fold cross-validation, the output reports results for both our training set (\(RMSE = 12539.86\)) and validation set (\(RMSE=24654.047\)).
# create feature names
y <- "Sale_Price"
x <- setdiff(names(ames_train),y)
# turn training set into h2o object
train.h2o <- as.h2o(ames_train)
# training basic GBM model with defaults
h2o.fit1 <- h2o.gbm(
x = x,
y = y,
training_frame = train.h2o,
nfolds = 5 # performs 5 fold cross validation
)
# asess model results
h2o.fit1
## Model Details:
## ==============
##
## H2ORegressionModel: gbm
## Model ID: GBM_model_R_1533927247702_1
## Model Summary:
## number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
## 1 50 50 17591 5 5 5.00000 9 31 22.96000
##
##
## H2ORegressionMetrics: gbm
## ** Reported on training data. **
##
## MSE: 157248046
## RMSE: 12539.86
## MAE: 8988.278
## RMSLE: 0.08190755
## Mean Residual Deviance : 157248046
##
##
##
## H2ORegressionMetrics: gbm
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
##
## MSE: 612660925
## RMSE: 24751.99
## MAE: 15485.9
## RMSLE: 0.1369165
## Mean Residual Deviance : 612660925
##
##
## Cross-Validation Metrics Summary:
## mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid
## mae 15465.6455 403.8833 15899.954 14460.245 16051.8125 15667.94 15248.277
## mean_residual_deviance 6.1167661E8 6.4985276E7 6.851456E8 4.3496512E8 6.0995507E8 6.7103411E8 6.5728301E8
## mse 6.1167661E8 6.4985276E7 6.851456E8 4.3496512E8 6.0995507E8 6.7103411E8 6.5728301E8
## r2 0.9020621 0.008838167 0.8946199 0.9245291 0.89047533 0.8939862 0.90669996
## residual_deviance 6.1167661E8 6.4985276E7 6.851456E8 4.3496512E8 6.0995507E8 6.7103411E8 6.5728301E8
## rmse 24654.047 1388.2732 26175.287 20855.818 24697.27 25904.326 25637.531
## rmsle 0.13682812 0.009084022 0.15625001 0.12442712 0.12835869 0.12703055 0.14807428
Similar to xgboost, we can incorporate automated stopping so that we can crank up the number of trees but terminate training oncemodel improvement decreases or stops. There is also an option to terminate training after so much time has passed (see max_runtime_secs). In this example, I train a default model with 5,000 trees but stop training after 10 consecutive trees have no improvement on the cross-validated error. In this case, training stops after 2623 trees and has a cross-validated RMSE of $23,441.46.
# training basic GBM model with defaults
h2o.fit2 <- h2o.gbm(
x = x,
y = y,
training_frame = train.h2o,
nfolds = 5,
ntrees = 5000,
stopping_rounds = 10,
stopping_tolerance = 0,
seed=123
)
# model stopped after XX trees
h2o.fit2@parameters$ntrees
## [1] 2623
# cross validated RMSE
h2o.rmse(h2o.fit2, xval = TRUE)
## [1] 23441.46
H2O provides many parameters that can be adjusted. It is well worth your time to check out the available documentation at H2O.ai. For this chapter, we will focus on the more common hyperparameters that are adjusted. This includes:
Tree complexity: ntrees: number of trees to train * max_depth: depth of each tree * min_rows: Fewest observations allowed in a terminal node * Learning rate: * learn_rate: rate to descend the loss function gradient * learn_rate_annealing: allows you to have a high initial learn_rate, then gradually reduce as trees are added (speeds up training). * Adding stochastic nature: * sample_rate: row sample rate per tree * col_sample_rate: column sample rate per tree (synonymous with xgboost’s colsample_bytree)
Note that there are parameters that control how categorical and continuous variables are encoded, binned and split. The defaults tend to perform quite well , but we have been able to gain small improvements in certain circumustances by adjusting these. We will not cover them but they are worth reviewing.
To perform grid search tuning with H2O we can perform either a full or random discrete grid search as discussed in Section @ref(rf-h2o-regression-tune).
We7ll start with a full grid search. However, to speed up training with h2o I will use a validation set rather than perform k-fold cross validation. The following creates a hyperparameter grid consisting of 216 hyperparameter combinations. We apply h2o.grid to perform a grid search while also incorporating stopping parameters to reduce training time.
We see that the worst model had and RMSE of $34,142.31 (\(\sqrt{476228672}\)) and the best had an RMSE of $21,822.66 (\(\sqrt{1165697554}\)). A few characteristics pop out when we assess the results - models that search across a sample of columns, include more iterations via deeper trees. and allow nodes with single observations tend to perform best.
Notes This full grid search took 18 minutes.
# create training & validation sets
split <- h2o.splitFrame(train.h2o, ratios=0.75)
train <- split[[1]]
valid <- split[[2]]
# create hyperparameter grid
hyper_grid <- list(
max_depth = c(1,3,5),
min_rows = c(1,5,10),
learn_rate = c(0.05, 0.1, 0.15),
learn_rate_annealing = c(.99, 1),
sample_rate = c(.75,1),
col_sample_rate = c(.9, 1)
)
# perform grid search
grid <- h2o.grid(
algorithm = "gbm",
grid_id = "gbm_grid1",
x = x,
y = y,
training_frame = train,
validation_frame = valid,
hyper_params = hyper_grid,
ntrees = 5000,
stopping_rounds = 10,
stopping_tolerance = 0,
seed = 123
)
# collect the results and sort by our model performance metric of choice
grid_perf <- h2o.getGrid(
grid_id = "gbm_grid1",
sort_by = "mse",
decreasing = FALSE
)
grid_perf
## H2O Grid Details
## ================
##
## Grid ID: gbm_grid1
## Used hyper parameters:
## - col_sample_rate
## - learn_rate
## - learn_rate_annealing
## - max_depth
## - min_rows
## - sample_rate
## Number of models: 216
## Number of failed models: 0
##
## Hyper-Parameter Search Summary: ordered by increasing mse
## col_sample_rate learn_rate learn_rate_annealing max_depth min_rows sample_rate model_ids mse
## 1 0.9 0.05 1.0 5 1.0 1.0 gbm_grid1_model_138 4.762286724006572E8
## 2 0.9 0.05 1.0 5 1.0 0.75 gbm_grid1_model_30 4.9950455156501746E8
## 3 0.9 0.1 0.99 5 1.0 1.0 gbm_grid1_model_134 5.04407597702603E8
## 4 0.9 0.15 1.0 5 1.0 1.0 gbm_grid1_model_142 5.1283047945094573E8
## 5 0.9 0.05 0.99 5 1.0 1.0 gbm_grid1_model_132 5.1307290689450026E8
##
## ---
## col_sample_rate learn_rate learn_rate_annealing max_depth min_rows sample_rate model_ids mse
## 211 1.0 0.05 0.99 1 5.0 0.75 gbm_grid1_model_37 1.1602102219725711E9
## 212 1.0 0.05 0.99 1 10.0 0.75 gbm_grid1_model_73 1.1620871092432442E9
## 213 0.9 0.05 0.99 1 5.0 1.0 gbm_grid1_model_144 1.1631655136872623E9
## 214 1.0 0.05 0.99 1 5.0 1.0 gbm_grid1_model_145 1.164083893251335E9
## 215 0.9 0.05 0.99 1 10.0 1.0 gbm_grid1_model_180 1.1653298042199028E9
## 216 1.0 0.05 0.99 1 10.0 1.0 gbm_grid1_model_181 1.1656975535614166E9
We can check out more details of the best perfoming model. We can see that our best model used all 5,000 trees. Thus, a future grid search may want to increase the number of trees.
# get the model_id for the top model, chosen by validation error
best_model_id <- grid_perf@model_ids[[1]]
best_model <- h2o.getModel(best_model_id)
best_model@parameters$ntrees
## [1] 5000
# Now let's get performance metrics on the best model
h2o.performance(model = best_model, valid = TRUE)
## H2ORegressionMetrics: gbm
## ** Reported on validation data. **
##
## MSE: 476228672
## RMSE: 21822.66
## MAE: 13972.88
## RMSLE: 0.1341252
## Mean Residual Deviance : 476228672
As discussed in Section @ref(rf-h2o-regression-tune-random), h2o also allows us to perform a random grid search that allows early stopping. We can build onto the previous results and perform a random discret grid.
This time, I increase the max_depth, refine the min_row, and allow for 80% col_sample_rate. I keep all hyperparameter search criteria the same. I also add a search criteria that stops the grid search if none of the last 10 models have managed to have 0.5% improvement in MSE compared to the best model before that. If we continue to find improvements then I cut the grid search off after 1800 seconds (30 minutes). In this example, our search went for the entire 90 minutes and evaluated 57 of the 216 potential models.
In the body of the grid search, notice that I increase the trees to 10,000 since our best model used all 5,000 but I also include a stopping mechanism so that the model quits adding trees once no improvement is made in the validation RMSE.
Notes This random grid took 30 minuteS. It only assessed a third of the number of models the full grid search did but keep in mind that this grid search was assessing models with very low learning rates, which can take a long time.
# refined hyperparameter grid
hyper_grid <- list(
max_depth = c(5, 7, 9),
min_rows = c(1, 3, 5),
learn_rate = c(0.05, 0.1, 0.15),
learn_rate_annealing = c(.99, 1),
sample_rate = c(.75, 1),
col_sample_rate = c(.8, .9)
)
# random grid search criteria
search_criteria <- list(
starategy = "RandomDiscrete",
stopping_metric = "mse",
stopping_tolerance = 0.005,
stopping_rounds = 10,
max_runtime_secs = 60*30
)
# perform grid search
grid <- h2o.grid(
algorithm = "gbm",
grid_id = "gbm_grid2",
x = x,
y = y,
training_frame = train,
validation_frame = valid,
hyper_params = hyper_grid,
search_criteria = search_criteria, # add search criteria
ntrees = 10000,
stopping_rounds = 10,
stopping_tolerance = 0,
seed = 123
)
# collect the results and sort by our model performance metric of choice
grid_perf <- h2o.getGrid(
grid_id = "gbm_grid2",
sort_by = "mse",
decreasing = FALSE
)
grid_perf
## H2O Grid Details
## ================
##
## Grid ID: gbm_grid2
## Used hyper parameters:
## - col_sample_rate
## - learn_rate
## - learn_rate_annealing
## - max_depth
## - min_rows
## - sample_rate
## Number of models: 68
## Number of failed models: 0
##
## Hyper-Parameter Search Summary: ordered by increasing mse
## col_sample_rate learn_rate learn_rate_annealing max_depth min_rows sample_rate model_ids mse
## 1 0.9 0.05 1.0 5 1.0 1.0 gbm_grid2_model_50 4.7624213765845805E8
## 2 0.8 0.1 0.99 5 5.0 0.75 gbm_grid2_model_40 4.9473498408073145E8
## 3 0.8 0.1 0.99 5 1.0 0.75 gbm_grid2_model_34 4.989151160193848E8
## 4 0.9 0.05 1.0 5 1.0 0.75 gbm_grid2_model_2 4.99534198008744E8
## 5 0.9 0.1 0.99 7 1.0 1.0 gbm_grid2_model_46 5.1068183500740755E8
##
## ---
## col_sample_rate learn_rate learn_rate_annealing max_depth min_rows sample_rate model_ids mse
## 63 0.9 0.1 0.99 9 1.0 1.0 gbm_grid2_model_19 6.888828473137908E8
## 64 0.9 0.05 1.0 9 5.0 1.0 gbm_grid2_model_66 6.978688028588266E8
## 65 0.9 0.05 0.99 9 3.0 1.0 gbm_grid2_model_53 7.120703549234108E8
## 66 0.8 0.15 1.0 9 1.0 0.75 gbm_grid2_model_64 7.175292869435712E8
## 67 0.9 0.1 0.99 9 3.0 1.0 gbm_grid2_model_3 7.793274661905156E8
## 68 0.8 0.15 1.0 7 5.0 0.75 gbm_grid2_model_67 8.279874389570463E8
In this example, the best model obtained a cross-validated RMSE of $21,822.97. So although we assessed only 31% of the total models we were able to find a model that was better than our initial full grid search.